Believe it or not, despite all of the complexity under the hood, fitting a linear model in R with least squares is quite simple with a straightfoward workflow.
Let’s go through each step with an example of seals. Are older seals larger?
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
seals <- read.csv("./data/17e8ShrinkingSeals Trites 1996.csv")
seal_base <- ggplot(seals, aes(x=age.days, y=length.cm)) +
geom_point() +
theme_grey(base_size=14)
seal_base
Neat data set, no?
Now, looking at this from the get-go, we can see it’s likely nonlinear. Maybe even non-normal! Let’s ignore that for now, as it will make the results great fodder for our diagnostics!
OK, so, you have the data. And in your model, you want to see how age is a predictor of length. How do you fit it?
Nothing could be simpler. R has a function called lm() which stands for linear model. It works just like base plot, with the y ~ x syntax. And we create a fit model object.
seal_lm <- lm(length.cm ~ age.days, data=seals)
That’s it.
Now, if we want to just peak at the fit, before going forward, we can use coef() which is pretty standard across all R fit model functions.
coef(seal_lm)
## (Intercept) age.days
## 1.157668e+02 2.370559e-03
But that’s getting ahead of ourselves…
R also provides a 1-stop shop for evaluating functions. Fit model objects can typically be plotted. Now, it uses base plot, so, we’ll use the par function to setup a 2x2 plotting area.
par(mfrow = c(2,2)) #2 rows, 2 columns
plot(seal_lm)
par(mfrow=c(1,1)) #reset back to 1 panel
Whoah - that’s a lot! And, there’s no figure with Cook’s D or a histogram of residuals.
OK, breathe.
plot.lm() actualyl generates even more plots than shown here. You can specify what plot you want with the which argument, but will need to look at ?plot.lm to know just what to look at.
I have five plots I really like to look at - four of which plot.lm() will generate. Those four are the fitted versus residuals:
plot(seal_lm, which=1)
Note the curvature of the line? Troubling, or a high n?
A QQ plot of the residuals
plot(seal_lm, which=2)
Not bad!
The Cook’s D values
plot(seal_lm, which=4)
All quite small!
And last, leverage
plot(seal_lm, which=5)
I also like to look at the histogram of residuals.There is a function called residuals that will work on nearly any fit model object in R. So we can just…
hist(residuals(seal_lm))
Note, there’s also a library called modelr which can add the appropriate residuals to your data frame using a dplyr-like syntax.
library(modelr)
seals <- seals %>%
add_residuals(seal_lm)
head(seals)
## age.days length.cm resid
## 1 5337 131 2.5815746
## 2 2081 123 2.3001156
## 3 2085 122 1.2906334
## 4 4299 136 10.0422152
## 5 2861 122 -0.5489206
## 6 5052 131 3.2571840
Check out that new column. You can now plot your predictor versus residuals, which should show no trend, which you can use a spline with stat_smooth to see.
qplot(age.days, resid, data=seals) +
stat_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
And you can also add fitted values and look at fitted versus residual.
seals <- seals %>%
add_predictions(seal_lm)
qplot(pred, resid, data=seals) +
stat_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'
OK, ok, everything looks fine. Now, how do we test our model.
First, F-tests! R has a base method called anova which - well, it’s for looking at analysis of partitioning variance, but really will take on a wide variety of forms as we go forward. For now, it will produce F tables for us
anova(seal_lm)
## Analysis of Variance Table
##
## Response: length.cm
## Df Sum Sq Mean Sq F value Pr(>F)
## age.days 1 90862 90862 2815.8 < 2.2e-16 ***
## Residuals 9663 311807 32
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Boom! P values! And they are low. Simple, no?
For more information - t tests, Rsummary() - again, a function that is a go-to for nearly any fit model.
summary(seal_lm)
##
## Call:
## lm(formula = length.cm ~ age.days, data = seals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.8700 -3.8279 0.0304 3.7541 21.6874
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.158e+02 1.764e-01 656.37 <2e-16 ***
## age.days 2.371e-03 4.467e-05 53.06 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.681 on 9663 degrees of freedom
## Multiple R-squared: 0.2256, Adjusted R-squared: 0.2256
## F-statistic: 2816 on 1 and 9663 DF, p-value: < 2.2e-16
This is a lot of information to drink in - function call, distribution of residuals, coefficient t-tests, and multiple pieces of information about total fit.
We may want to get this information in a more condensed form for use in other contexts - particularly to compare against other models. For that, there’s a wonderful packages called broom that sweeps up your model into easy digestable pieces.
First, the coefficient table - let’s make it pretty.
library(broom)
##
## Attaching package: 'broom'
## The following object is masked from 'package:modelr':
##
## bootstrap
tidy(seal_lm)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 116. 0.176 656. 0
## 2 age.days 0.00237 0.0000447 53.1 0
Nice.
We can also do this for the F-test.
tidy(anova(seal_lm))
## # A tibble: 2 x 6
## term df sumsq meansq statistic p.value
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 age.days 1 90862. 90862. 2816. 0
## 2 Residuals 9663 311807. 32.3 NA NA
If we want to get information about fit, there’s glance()
glance(seal_lm)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.226 0.226 5.68 2816. 0 2 -30502. 61009. 61031.
## # … with 2 more variables: deviance <dbl>, df.residual <int>
Lovely! Now, how do we visualize the fit and fit prediction error?
In ggplot2 we can use the smoother, stat_smooth in conjunction with method = "lm" to get the job done.
seal_fit_plot <- ggplot(data=seals) +
aes(x=age.days, y=length.cm) +
geom_point() +
stat_smooth(method="lm")
seal_fit_plot
## `geom_smooth()` using formula 'y ~ x'
Fist, the relationship between how lean you are and how quickly you lose fat. Implement this to get a sense ot the general workflow for analysis
library(ggplot2)
fat <- read.csv("./data/17q04BodyFatHeatLoss Sloan and Keatinge 1973 replica.csv")
#initial visualization to determine if lm is appropriate
fat_plot <- ggplot(data=fat, aes(x=leanness, y=lossrate)) +
geom_point()
fat_plot
fat_mod <- lm(lossrate ~ leanness, data=fat)
#assumptions
plot(fat_mod, which=1)
plot(fat_mod, which=2)
#f-tests of model
anova(fat_mod)
#t-tests of parameters
summary(fat_mod)
#plot with line
fat_plot +
stat_smooth(method=lm, formula=y~x)
For your first faded example, let’s look at the relationship between DEET and mosquito bites.
deet <- read.csv("./data/17q24DEETMosquiteBites.csv")
deet_plot <- ggplot(data=___, aes(x=dose, y=bites)) +
geom_point()
deet_plot
deet_mod <- lm(bites ~ dose, data=deet)
#assumptions
plot(___, which=1)
plot(___, which=2)
#f-tests of model
anova(___)
#t-tests of parameters
summary(___)
#plot with line
deet_plot +
stat_smooth(method=lm, formula=y~x)
We’ll start today with the dataset 15e1KneesWhoSayNight.csv about an experiment to help resolve jetlag by having people shine lights at different parts of themselves to try and shift their internal clocks.
library(tidyverse)
## ── Attaching packages ───────── tidyverse 1.2.1 ──
## ✓ tibble 3.0.1 ✓ purrr 0.3.4
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## Warning: package 'tibble' was built under R version 3.6.2
## Warning: package 'purrr' was built under R version 3.6.2
## ── Conflicts ──────────── tidyverse_conflicts() ──
## x broom::bootstrap() masks modelr::bootstrap()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
knees <- read_csv("./data/15e1KneesWhoSayNight.csv")
## Parsed with column specification:
## cols(
## treatment = col_character(),
## shift = col_double()
## )
We can see the outcomes with ggplot2
library(ggplot2)
ggplot(knees, mapping=aes(x=treatment, y=shift)) +
stat_summary(color="red", size=1.3) +
geom_point(alpha=0.7) +
theme_bw(base_size=17)
## No summary function supplied, defaulting to `mean_se()`
As the underlying model of ANOVA is a linear one, we fit ANOVAs using lm() just as with linear regression.
knees <- read.csv("./data/15e1KneesWhoSayNight.csv")
knees_lm <- lm(shift ~ treatment, data=knees)
Now, there are two things to notice here. One, note that treatment is a either a character or factor. If it is not, because we are using lm(), it will be fit like a linear regression. So, beware!
There is an ANOVA-specific model fitting function - aov.
knees_aov <- aov(shift ~ treatment, data=knees)
It’s ok, I guess, and works with a few functions that lm() objects do not. But, in general, I find it too limiting. You can’t see coefficients, etc. Boooooring.
Because this is an lm, we can check our assumptions as before - with one new one. First, some oldies but goodies.
#The whole par thing lets me make a multi-panel plot
par(mfrow=c(2,2))
plot(knees_lm, which=c(1,2,5))
par(mfrow=c(1,1))
Now, the residuals v. fitted lets us see how the residuals are distributed by treatment, but I often find it insufficient, as spacing on the x-axis can get odd. I could roll my own plot of resudials versus treatment, but, there’s a wonderful package called car - which is from the book Companion to Applied Regression by John Fox. I recommend it highly! It has a function in it called residualPlots() which is useful here.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:purrr':
##
## some
## The following object is masked from 'package:dplyr':
##
## recode
residualPlots(knees_lm)
Note how it both does fitted v. residuals but also a boxplot by treatment. Handy, no?
OK, so, let’s see the ANOVA table! With the function….anova()!
anova(knees_lm)
## Analysis of Variance Table
##
## Response: shift
## Df Sum Sq Mean Sq F value Pr(>F)
## treatment 2 7.2245 3.6122 7.2894 0.004472 **
## Residuals 19 9.4153 0.4955
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now….this is a type I sums of squares test. Which is fine for a 1-way ANOVA. If you want to start getting into the practice of using type II, car provides a function Anova() - note the capital A - which defaults to type II and I use instead. In fact, I use it all the time, as it handles a wide set of different models.
Anova(knees_lm)
## Anova Table (Type II tests)
##
## Response: shift
## Sum Sq Df F value Pr(>F)
## treatment 7.2245 2 7.2894 0.004472 **
## Residuals 9.4153 19
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here it matters not a whit as you get the same table.
So, there are a lot of things we can do with a fit model
summary(knees_lm)
##
## Call:
## lm(formula = shift ~ treatment, data = knees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.27857 -0.36125 0.03857 0.61147 1.06571
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.30875 0.24888 -1.241 0.22988
## treatmenteyes -1.24268 0.36433 -3.411 0.00293 **
## treatmentknee -0.02696 0.36433 -0.074 0.94178
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7039 on 19 degrees of freedom
## Multiple R-squared: 0.4342, Adjusted R-squared: 0.3746
## F-statistic: 7.289 on 2 and 19 DF, p-value: 0.004472
First, notice that we get the same information as a linear regression - including \(R^2\) and overall model F-test. THis is great. We also get coefficients, but, what do they mean?
Well, they are the treatment contrasts. Not super useful. R fits a model where treatment 1 is the intercept, and then we look at deviations from that initial treatment as your other coefficients. It’s efficient, but, hard to make sense of. To not get an intercept term, you need to refit the model without the intercept. You can fit a whole new model with -1 in the model formulation. Or, as I like to do to ensure I don’t frak anything up, you can update() your model. Just use . to signify what was there before.
knees_lm_no_int <- update(knees_lm, formula = . ~ . -1)
summary(knees_lm_no_int)
##
## Call:
## lm(formula = shift ~ treatment - 1, data = knees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.27857 -0.36125 0.03857 0.61147 1.06571
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## treatmentcontrol -0.3087 0.2489 -1.241 0.230
## treatmenteyes -1.5514 0.2661 -5.831 1.29e-05 ***
## treatmentknee -0.3357 0.2661 -1.262 0.222
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7039 on 19 degrees of freedom
## Multiple R-squared: 0.6615, Adjusted R-squared: 0.6081
## F-statistic: 12.38 on 3 and 19 DF, p-value: 0.0001021
OK - that makes more sense. For a 1-way ANOVA, we can also see treatment means using the emmeans package - much more on that next week (and later below for contrasts).
library(emmeans)
library(multcompView)
emmeans(knees_lm, ~treatment)
## treatment emmean SE df lower.CL upper.CL
## control -0.309 0.249 19 -0.830 0.212
## eyes -1.551 0.266 19 -2.108 -0.995
## knee -0.336 0.266 19 -0.893 0.221
##
## Confidence level used: 0.95
I also like this because it outputs CIs.
We see means and if they are different from 0. But….what about post-hoc tests
If you have a priori contrasts, you can use the constrat library to test them. You give contrast an a list and a b list. Then we get all comparisons of a v. b, in order. It’s not great syntactically, but, it lets you do some pretty creative things.
contrast::contrast(knees_lm,
a = list(treatment = "control"),
b = list(treatment = "eyes"))
## lm model parameter contrast
##
## Contrast S.E. Lower Upper t df Pr(>|t|)
## 1 1.242679 0.3643283 0.4801306 2.005227 3.41 19 0.0029
Meh. 9 times out of 10 we want to do something more like a Tukey Test. There is a TukeyHSD function that works on aov objects, but, if you’re doing anything with an lm, it borks on you. Instead, let’s use emmeans. It is wonderful as it’s designed to work with ANOVA and ANCOVA models with complicated structures such that, for post-hocs, it adjusts to the mean or median level of all other factors. Very handy.
knees_em <- emmeans(knees_lm, ~treatment)
contrast(knees_em,
method = "tukey")
## contrast estimate SE df t.ratio p.value
## control - eyes 1.243 0.364 19 3.411 0.0079
## control - knee 0.027 0.364 19 0.074 0.9970
## eyes - knee -1.216 0.376 19 -3.231 0.0117
##
## P value adjustment: tukey method for comparing a family of 3 estimates
We don’t need to worry about many of the fancier things that emmeans does for the moment - those will become more useful with other models. But, we can look at this test a few different ways. First, we can visualize it
plot(contrast(knees_em,
method = "tukey")) +
geom_vline(xintercept = 0, color = "red", lty=2)
We can also, using our tukey method of adjustment, get “groups” - i.e., see which groups are statistically the same versus different.
library(multcompView)
CLD(knees_em, adjust="tukey")
## Warning: 'CLD' will be deprecated. Its use is discouraged.
## See '? CLD' for an explanation. Use 'pwpp' or 'multcomp::cld' instead.
## treatment emmean SE df lower.CL upper.CL .group
## eyes -1.551 0.266 19 -2.25 -0.855 1
## knee -0.336 0.266 19 -1.03 0.361 2
## control -0.309 0.249 19 -0.96 0.343 2
##
## Confidence level used: 0.95
## Conf-level adjustment: sidak method for 3 estimates
## P value adjustment: tukey method for comparing a family of 3 estimates
## significance level used: alpha = 0.05
This can be very useful in plotting. For example, we can use that output as a data frame for a ggplot in a few different ways.
CLD(knees_em, adjust="tukey") %>%
ggplot(aes(x = treatment, y = emmean,
ymin = lower.CL, ymax = upper.CL,
color = factor(.group))) +
geom_pointrange()
## Warning: 'CLD' will be deprecated. Its use is discouraged.
## See '? CLD' for an explanation. Use 'pwpp' or 'multcomp::cld' instead.
CLD(knees_em, adjust="tukey") %>%
mutate(.group = letters[as.numeric(.group)]) %>%
ggplot(aes(x = treatment, y = emmean,
ymin = lower.CL, ymax = upper.CL)) +
geom_pointrange() +
geom_text(mapping = aes(label = .group), y = rep(1, 3)) +
ylim(c(-2.5, 2))
## Warning: 'CLD' will be deprecated. Its use is discouraged.
## See '? CLD' for an explanation. Use 'pwpp' or 'multcomp::cld' instead.
knees_expanded <- left_join(knees, CLD(knees_em, adjust="tukey"))
## Warning: 'CLD' will be deprecated. Its use is discouraged.
## See '? CLD' for an explanation. Use 'pwpp' or 'multcomp::cld' instead.
## Joining, by = "treatment"
ggplot(knees_expanded,
aes(x = treatment, y = shift, color = .group)) +
geom_point()
We can similarly use this to look at a Dunnett’s test, which compares against the control
contrast(knees_em,
method = "dunnett")
## contrast estimate SE df t.ratio p.value
## eyes - control -1.243 0.364 19 -3.411 0.0057
## knee - control -0.027 0.364 19 -0.074 0.9927
##
## P value adjustment: dunnettx method for 2 tests
Note, if the “control” had not been the first treatment, you can either re-order the factor using forcats or just specify which of the levels is the control. For example, eyes is the second treatment. Let’s make it our new reference.
contrast(knees_em,
method = "dunnett", ref=2)
## contrast estimate SE df t.ratio p.value
## control - eyes 1.24 0.364 19 3.411 0.0057
## knee - eyes 1.22 0.376 19 3.231 0.0085
##
## P value adjustment: dunnettx method for 2 tests
You can even plot these results
plot(contrast(knees_em,
method = "dunnett", ref=2)) +
geom_vline(xintercept = 0, color = "red", lty=2)
Let’s say you wanted to do all pairwise tests, but, compare using a Bonferroni correction or FDR. Or none! No problem! There’s an adjust argument
contrast(knees_em,
method = "tukey", adjust="bonferroni")
## contrast estimate SE df t.ratio p.value
## control - eyes 1.243 0.364 19 3.411 0.0088
## control - knee 0.027 0.364 19 0.074 1.0000
## eyes - knee -1.216 0.376 19 -3.231 0.0132
##
## P value adjustment: bonferroni method for 3 tests
contrast(knees_em,
method = "tukey", adjust="fdr")
## contrast estimate SE df t.ratio p.value
## control - eyes 1.243 0.364 19 3.411 0.0066
## control - knee 0.027 0.364 19 0.074 0.9418
## eyes - knee -1.216 0.376 19 -3.231 0.0066
##
## P value adjustment: fdr method for 3 tests
contrast(knees_em,
method = "tukey", adjust="none")
## contrast estimate SE df t.ratio p.value
## control - eyes 1.243 0.364 19 3.411 0.0029
## control - knee 0.027 0.364 19 0.074 0.9418
## eyes - knee -1.216 0.376 19 -3.231 0.0044
Let’s try three ANOVAs! First - do landscape characteristics affect the number of generations plant species can exist before local extinction?
plants <- read.csv("./data/15q01PlantPopulationPersistence.csv")
#Visualize
qplot(treatment, generations, data=plants, geom="boxplot")
#fit
plant_lm <- lm(generations ~ treatment, data=plants)
#assumptions
plot(plant_lm, which=c(1,2,4,5))
#ANOVA
anova(plant_lm)
#Tukey's HSD
contrast(emmeans(plant_lm, ~treatment), method = "tukey")
Second, how do different host types affect nematode longevity?
worms <- read.csv("./data/15q19NematodeLifespan.csv")
#Visualize
qplot(treatment, lifespan, data=____, geom="____")
#fit
worm_lm <- lm(______ ~ ______, data=worms)
#assumptions
plot(______, which=c(1,2,4,5))
#ANOVA
anova(______)
#Tukey's HSD
contrast(emmeans(______, ~________), method = "tukey")
And last, how about how number of genotypes affect eelgrass productivity. Note, THERE IS A TRAP HERE. Look at your dataset before you do ANYTHING.
eelgrass <- read.csv("./data/15q05EelgrassGenotypes.csv")
#Visualize
________(treatment.genotypes, shoots, data=____, geom="____")
#fit
eelgrass_lm <- __(______ ~ ______, data=________)
#assumptions
________(______, which=c(1,2,4,5))
#ANOVA
________(______)
#Tukey's HSD
contrast(________(______, ~________), method = "________")